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We introduce hmmSeq, a model-based hierarchical Bayesian tech¬ 
nique for detecting differentially expressed genes from RNA-seq data. 

Our novel hmmSeq methodology uses hidden Markov models to ac¬ 
count for potential co-expression of neighboring genes. In addition, 
hmmSeq employs an integrated approach to studies with techni¬ 
cal or biological replicates, automatically adjusting for any extra- 
Poisson variability. Moreover, for cases when paired data are avail¬ 
able, hmmSeq includes a paired structure between treatments that 
iircoporates subject-specific effects. To perform parameter estimation 
for the hmmSeq model, we develop an efficient Markov chain Monte 
Carlo algorithm. Further, we develop a procedure for detection of dif¬ 
ferentially expressed genes that automatically controls false discovery 
rate. A simulation study shows that the hmmSeq methodology per¬ 
forms better than competitors in terms of receiver operating charac¬ 
teristic curves. Finally, the analyses of three publicly available RNA- 
seq data sets demonstrate the power and flexibility of the hmmSeq 
methodology. An R package implementing the hmmSeq framework 
will be submitted to CRAN upon publication of the manuscript. 

1. Introduction. RNA-seq has revolutionized the study of gene expres¬ 
sion. RNA-seq success may be attributed to its low noise, high-throughput 
and ability to interrogate allele-specific expression and isoforms [Zhao et al. 
(2014), Auer, Srivastava and Doerge (2012)]. Most RNA-seq studies aim to 
identify differentially expressed (DE) genes between samples corresponding 
to different treatments or biological conditions, for example, cancer tissue 
versus normal tissue, genetically engineered animals versus control animals, 
or patients exposed to two or more kinds of treatments. These differentially 
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expressed genes usually form the starting point of more extensive studies 
such as integration of expression data with transcription factor binding [Kar- 
lebach and Shamir (2008)], RNA interference [Pe’er and Hacohen (2011)] 
and DNA methylation [Louhimo and Hautaniemi (2011)], all of which can 
lead to a better understanding of regulatory mechanisms. Currently avail¬ 
able methods for RNA-seq data analysis assume that differential expression 
of genes occurs independent of the genomic loci of each gene [Auer and Do- 
erge (2011), Robinson and Smyth (2007, 2008), Hardcastle and Kelly (2010), 
Robinson, McCarthy and Smyth (2010), Si and Liu (2013)]. However, the lit¬ 
erature contains evidence that neighboring genes on the chromosome tend to 
be co-expressed [Caron et al. (2001), Singer et al. (2005), Michalak (2008)]. 
To account for and take advantage of this potential co-expression, here we 
introduce hmmSeq. 

Our hmmSeq framework incorporates potential co-expression by model¬ 
ing differential expression across the genome using hidden Markov models 
(HMM). In the hmmSeq framework that we propose, neighboring gene co¬ 
expression may occur in two ways: in differential expression across treat¬ 
ments and in mean expression magnitude. Thus, we model gene differential 
expression across treatments using an HMM with three states: not differ¬ 
entially expressed, under- or over-expressed. This HMM takes advantage 
of the potential co-differential expression by borrowing information across 
neighboring genes on the chromosome. In addition, we model gene mean 
expression magnitude with an HMM with two states: low expression and 
high expression. The latter HMM borrows information across the genome 
to increase estimation precision of the mean expression magnitude of each 
gene. As we show in the simulation study in Section 4, the use of informa¬ 
tion both from neighboring genes and across the genome increases detection 
power and reduces false discovery. 

The existing methods for RNA-seq data analysis do not account for the 
potential co-expression of neighboring genes. Robinson and Smyth (2007, 
2008) use the negative binomial distribution to model over-dispersed data 
through dispersion parameters. Specifically, Robinson and Smyth (2008) as¬ 
sume a common dispersion parameter across all tags (or genes), whereas 
Robinson and Smyth (2007) assume tag-wise (or gene-wise) dispersion pa¬ 
rameters. To estimate these dispersion parameters, they assume a Gaussian 
hierarchical hyperprior that is estimated using empirical Bayes. After that, 
the gene-wise dispersion parameters are estimated by maximum weighted 
likelihood. This method is implemented in edgeR [Robinson, McCarthy and 
Smyth (2010)]. The baySeq method of Hardcastle and Kelly (2010) is an 
empirical Bayes approach that is also based on the negative binomial dis¬ 
tribution. An empirically determined prior distribution is derived from the 
entire data set, and rather than producing significance values, this method 
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calculates posterior probabilities of multiple models of differential expres¬ 
sion, ranking the genes by the model probabilities. Blekhman et ah (2010) 
analyze RNA-seq data by a Poisson generalized linear mixed-effect model, 
which explains inter-individual variability through the inclusion of a ran¬ 
dom individnal-specific effect. Data are fitted under the null and alternative 
models gene by gene, then a likelihood ratio test is conducted to compute 
p-values, and the false discovery rate (FDR; defined as the proportion of 
incorrect calls among the genes declared as DE) is controlled by the method 
of Storey and Tibshirani (2003). Auer and Doerge (2011) have proposed the 
two-stage Poisson model (TSPM) which assumes data contain both overdis¬ 
persed and nonoverdispersed genes. This technique seeks to reduce FDR by 
first separating the overdispersed genes from the nonoverdispersed genes, 
and then fitting separate models to compute the p-values. Benjamini and 
Hochberg (1995) FDR controlling is applied on each set of p-values to iden¬ 
tify DE genes. Si and Liu (2013) developed a test for the hypothesis that the 
log fold change belongs to a subset of the real line. By assuming parameters 
under a null and alternative hypothesis come from different distributions, 
they estimate this mixture distribution from the data, then the test statis¬ 
tic is obtained as the ratio of unconditional probability from null parame¬ 
ter space over unconditional probability from full space. All these previous 
methods assume that the genes are conditionally independent. However, the 
exploratory data analysis we present in Section 2.2 suggests dependence 
among neighboring genes. Our hmmSeq method addresses this dependence. 

Our hmmSeq framework may also accommodate the case when there is no 
dependence among the expression of neighboring genes. Specifically, HMMs 
include as particular cases mixture models. In particular, the number of 
components of the mixture model will be the same as the number of states 
in the HMM. Thus, when there is no co-differential expression, the result will 
be a mixture model with three components that correspond to a gene being 
not differentially expressed, under- or over-expressed. Likewise, when there 
is no dependence in mean expression magnitude among neighboring genes, 
the resulting mixture model will have two components, one component for 
low expression genes and another for high expression genes. Note that the 
proportion of genes in each component and the parameters of the generating 
model for each component will be estimated from the data. Thus, even with¬ 
out neighboring genes dependence, the hmmSeq framework will still borrow 
information across the genome to learn abont each of the mixture compo¬ 
nents and, as a result, increase estimation precision and detection power. 

We model extra-Poisson variability in an indirect manner. If the exper¬ 
iment contains technical replicates (i.e., samples from the same snbject), 
then the literature provides evidence that the RNA-seq counts are Poisson 
distributed [Marioni et al. (2008), Bullard et al. (2010)]. On the other hand, 
if the experiment contains biological replicates, then the RNA-seq counts 
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will have extra-Poisson variability [Langmead et al. (2010), Robinson and 
Smyth (2007)]. This extra-Poisson variability may be a result of across- 
subjects variability or slight differences in the experimental conditions when 
the samples were taken or analyzed. While the RNA-seq literature usually 
uses the negative binomial distribution to model the extra-Poisson variabil¬ 
ity, another way to deal with this extra-variability is through the use of 
the Poisson distribution together with random effects. We prefer the latter 
because it provides a framework that can flexibly deal with known sources 
of extra-Poisson variability such as, for example, biological variation among 
subjects. In the case of paired data considered in Section 5.3, we deal with 
the biological variation by including for each gene subject-specific random 
effects. Moreover, for nonpaired data we implicitly deal with the subject- 
specific random effects (and any other source of extra-Poisson variability) 
by assuming that for nondifferentially expressed genes the differential treat¬ 
ment effect parameter may come from a normal distribution centered at 
zero. Hence, for nondifferentially expressed genes the differential treatment 
effect parameter is a sum of the random effects of subjects and other hidden 
sources. In addition to facilitating the implementation of our HMM frame¬ 
work, this aspect of our model increases robustness with respect to hidden 
unforeseen sources of variability. 

We have investigated in three fronts the practical usefulness and adequacy 
of HMMs and mixture components models for RNA-seq data analysis. First, 
we have performed an exploratory data analysis presented in Section 2.2 that 
studies for two real RNA-seq data sets the empirical statistical properties 
of preliminary estimates of differential expression parameters and mean ex¬ 
pression magnitude parameters. Two patterns emerge from this exploratory 
data analysis: the existence of three differential expression states and of two 
mean expression magnitude states; and a possible dependence across neigh¬ 
boring genes. Second, we have performed a simulation study that considers 
all four possible combinations of HMMs and mixture components models 
for differential expression and mean expression magnitude. This simulation 
study compares the performance of our hmmSeq framework with compet¬ 
ing RNA-seq analysis methodologies. In all four possible cases, our hmm¬ 
Seq framework beats the competing methods in terms of receiver operating 
characteristic curves. Finally, we have used the deviance information crite¬ 
rion (DIG) [Spiegelhalter et al. (2002)] to decide among the four possible 
combinations of HMMs and mixture components models what is the most 
adequate model for each of three real RNA-seq data sets. Our use of the 
DIG is justified by its good performance in a simulation study presented in 
Section 4. The DIG indicates dependence across neighboring genes for two of 
the three data sets. Therefore, in this paper we provide further evidence that 
for some biological processes neighboring genes on the chromosome tend to 
be co-expressed. 
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We take a full Bayesian analysis approach and develop a Markov chain 
Monte Carlo algorithm to exploit the posterior distribution of the model 
parameters. To simulate the differential effects and the mean effect mag¬ 
nitudes, we develop an efficient Metropolis-Bastings algorithm for hidden 
Markov models. In addition, we use the output of the MCMC algorithm to 
identify differentially expressed genes while controlling for false discovery 
rate. Specifically, we use a Bayesian approach for controlling the FDR level 
proposed by Newton et al. (2004) and further studied by Muller, Parmigiani 
and Rice (2007). We demonstrate the advantages and benefits of our hmm- 
Seq methodology by analyzing three RNA-seq data sets. The first data set 
[Marioni et al. (2008)] consists of hve technical replicates each of a kidney 
and liver RNA sample. The second data set [Zeng et al. (2012)] consists of six 
biological replicates extracted from two regions, frontal pole and hippocam¬ 
pus, of normal human brains. Finally, the third data set [Henn et al. (2013)] 
consists of paired B-cell samples data of day 0 (before vaccination) and day 7 
(post-vaccination) for 3 pre-vaccinated subjects. Therefore, we demonstrate 
the power and flexibility of the hmmSeq methodology on three types of 
RNA-seq data: technical replicates, biological replicates and paired samples. 

The remainder of the paper is organized as follows. Section 2 describes the 
details of the hmmSeq model and informally demonstrates the necessity for 
hidden Markov models in Section 2.2. Section 3 describes the posterior in¬ 
ference procedure based on Markov chain Monte Carlo (MCMC) techniques 
and the procedure for identification of DE genes. Section 4 uses simulated 
data to demonstrate the effectiveness of hmmSeq relative to well-known 
techniques for RNA-seq. The Marioni et al. (2008), Zeng et al. (2012) and 
Henn et al. (2013) data sets are analyzed in Sections 5.1, 5.2 and 5.3. In 
all cases, the results are compared and contrasted with those of existing 
approaches to demonstrate the success of hmmSeq. A functional analysis of 
the detected sets of DE genes provides further evidence of the reliability of 
the procedure. An R package implementing the hmmSeq framework will be 
submitted to CRAN upon publication of the manuscript. 

2. A Bayesian hierarchical model for RNA-seq data. We focus on two- 
treatment comparisons. For a given chromosome c, let denote the 

integer-valued gene read of the /cth replicate of gene i under treatment j, for 
gene i = 1,2,..., /c, treatment j = 1,2, and replicate k = 1,2,..., Kj on chro¬ 
mosome c. The genes are sequentially arranged so that consecutive indices 
correspond to neighboring genes on the chromosome. We assume that 

Yijkc Poisson(Aijfcc) where 
(2.1) log(Ajifc(;) — Pic ^ic T Pik and 

log(Aj2fcc) = Pic + ‘^ic-k P2k, 
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where /3jc denotes the mean expression magnitude of gene i and 2Ajc de¬ 
notes the log-fold change between the treatments. The treatment-specific 
replicate effects are represented by pjk- We observe that the treatments are 
a priori interchangeable in equation (2.1). The differential treatment effect 
Aic for gene i is key because it determines the relative expression levels of 
the treatments for the gene. That is, Ajc determines whether treatment 2 is 
over-, under-, or nondifferentially expressed with respect to treatment 1. 

To model the mean expression magnitude, the possible dependence among 
the /3ic’s of neighboring genes on a chromosome is modeled using either a 
two-component hnite mixture model [Titterington, Smith and Makov (1985), 
Friihwirth-Schnatter (2006)] or a stationary two-state hidden Markov model 
[Rabiner (1989), MacDonald and Zucchini (1997)]. The latent average ex¬ 
pression state Sic determines whether the expression of gene i, averaged over 
treatments and replicates, is “small” (sjc = 1) or “large” (sjc = 2). In the 
absence of differential treatment and replicate effects, the two levels of this 
categorical variable correspond, respectively, to low and high reads for the 
genes. Conditional on Sic, the average expression ffc is normally distributed: 




I indep 
\Sir ~ 





with pic < iJ, 2 c- The latent states sic,..., sj^c follow either a hnite mixture 
model (FMM) with probability vector Pc = {pic,P 2 c) or a hidden Markov 
model (HMM) with stationary transition probability matrix Ac = {{autc)) 2 x 2 
with the row sums 2 ^utc = 1 for tt = 1,2. We denote the two-component 
FMM by J- 2 c and the two-state HMM by 7^20 assuming independent, non- 
informative priors for its dispersion parameters: p(u^c) ^uc ^ ~ 2. 

To model differential expression, the differential effects Aic,..., A/^c of the 
genes are modeled either by a hnite mixture model (FMM) J^ 3 c with proba¬ 
bility vector Qc = {qic q2c: Qsc) or by a three-state stationary HMM denoted 
by 773c; the matrix of transition probabilities is denoted by Be = {{bvtc)) 3 x 3 
with row sums Ylt=i^ytc = 1 for u = 1,2,3. With latent differential states 
hic, ■ ■ ■ ,hi^c taking values in {1,2,3}, the values correspond, respectively, 
to the gene-specihe under-, nondifferential-, and over-expression of treat¬ 
ment 2 relative to treatment 1. Given the state ffc, differential effect Ajc is 
distributed as 


( 2 . 2 ) 


Aic\hic ~ 


' N{(pic,Tfj, 
^ ^(0,t|c)> 
^N{(l)3c,ric), 


if hie = 1 (under-expressed), 

if hie = 2 (nondifferentially-expressed), 

if hie = 3 (over-expressed), 


where 4>ic < 0 and (p^c > 0. Thus, for each chromosome, hie ,..., hj^c are the 
parameters of interest because they identify the set of DE genes. 

We observe that the latent states of both FMM T 2 c and HMM 772c are 
nonexchangeable, being associated with particular biological conditions. The 
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priors for the state parameters are designed to reflect this and also to pre¬ 
vent label switching [Scott ( 2002 )]. Specifically, the mean parameters, /iic 
and /i 2 c are assigned the prior tJ- 2 c) oc 1 {/xic<m 2 c-< 5 } where <5 > 0 is a 

predetermined constant. The fact that 5 is strictly positive guarantees that 
fJ-ic < /^2c and the two states are identifiable. 

For the same reason, for the FMM and HMM Hsc, we assume that 
Pihchc) oc where ui < 0 and ^3 > 0 are prespecified con¬ 

stants that can be chosen as follows. The log-fold change between the over- 
and under-expressed categories is at least {I 3 — ui). From a practical stand¬ 
point, in order to distinguish between these two categories, it is reasonable 
to assume that the ratios of their associated Ajc’s exceed 2. Because of this, 
we symmetrically set ui = —(log2)/2 and I 3 = (log2)/2. To further facilitate 
inferences of the state-specific parameters, informative conjugate priors are 
assigned to rf^, and 

2.1. Paired data analysis. Our hmmSeq framework may also accommo¬ 
date paired data, that is, the case when each subject undergoes each of the 
treatments. Here we describe the minor changes needed for that purpose. 
For a given chromosome c, let denote the gene read of the kth subject 
of gene i under treatment j, for subject k = 1,..., K. Obviously, because of 
the paired data structure there exists dependence between observations on 
the same subject. To account for this dependence, we assume that 

Yijkc Poisson(Ajjfcc) where 
(2.3) log(Aiifcc) = Ac - Ajc-L Ejfcc + Pifc and 

log(Aj 2 fcc) — (die T Ajc -|- Eikc T P 2 ki 

with Eikc N{0,a‘^) denoting the subject-specific random effects. The other 
parameters in the paired-data model have the same interpretations as in 
equation ( 2 . 1 ). 

2.2. Exploratory data analysis. We have performed an exploratory data 
analysis (EDA) to verify some of the hmmSeq model assumptions for the 
Marioni et al. (2008) and Zeng et al. (2012) data sets. Both data sets possess 
the feature described by Bullard et al. (2010); for libraries under each treat¬ 
ment, 5% of the genes account for over 50% of the total library size, and 
10% of the genes account for over 60% of the total library size. Focusing only 
on those genes whose reads exceeded nine for all libraries, and ignoring the 
replicate effects, we preprocessed the raw counts using the upper-quartile 
normalizing technique of Bullard et al. (2010). 

For the genes i = 1,2,... ,Ic of each chromosome, we have computed pre¬ 
liminary estimates of the expression magnitude Ac differential effect 
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N = 306 Bandwidth = 0.4411 N = 1174 Bandwidth = 0.07349 

(a) (b) 

Fig. 1. Marioni et al. (2008) data set—density plots of preliminary estimates for the 
expression magnitudes /3 and the differential expression effects A. (a) Density plot of fti 
for chromosome 13, (b) density plot of Ai for chromosome 19. 


Aic by treating these parameters as the fixed effects in a Poisson regres¬ 
sion model. Figure 1 displays graphical summaries of these estimates for a 
few chromosomes of the Marioni et al. (2008) data set. The results were sim¬ 
ilar for the other chromosomes. The density plot for the /?i’s in Figure 1(a) 
and for the Aj’s in Figure 1(b) are indicative of mixture of densities rep¬ 
resentations for these parameters. Further, data analysis in Section 5.1 will 
select the dependence structures of /3i’s and Aj’s by DIG model selection. 

For the Zeng et al. (2012) data set, Figure 2 displays density plots for pre¬ 
liminary estimates of the model parameters and also reveals a similar pattern 
as the Marioni et al. (2008) data set. The above analysis, together with data 
analysis in Section 5.2, suggests the need for mixture models, with first order 
dependence to model the expression magnitudes /3jc and differential effects 


A 

P 




N = 222 Bandwidth = 0.5159 N = 533 Bandwidth = 0.06036 

(a) (b) 


Fig. 2. Zeng et al. (2012) data set—density plots of preliminary estimates for the ex¬ 
pression magnitudes (3 and the differential expression effects A. (a) Density plot of jSi for 
chromosome 13, (b) density plot of Ai for chromosome 9. 
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Aic, justifying the use of the hidden Markov models 'H 2 c and 'H^.c in the 
hmmSeq method. 

3. Posterior inference. We investigate the posterior distribution of the 
chromosome-specific parameters using Markov chain Monte Carlo (MCMC) 
methods. Gibbs sampling cannot be applied to generate the parameters in 
equation (2.1) because the Poisson likelihood function is not conjugate to 
the normal priors of the parameters. Consequently, we apply the Laplace 
approximation [e.g., Zeger and Karim (1991), Chib and Greenberg (1994)] to 
generate proposed updates for the equation (2.1) parameters. The proposals 
are accepted or rejected by a Metropolis-Hastings probability to compensate 
for the use of an approximation instead of the Poisson distribution. This 
guarantees the convergence of the Markov chain to the posterior distribution 
of the hmmSeq model. We analyze each chromosome separately and, for 
simplicity of notation, in this section we omit the chromosome index c. 

3.1. Metropolis-Hastings algorithm for HMMs. In this section we present 
a Metropolis-Hastings algorithm for the simulation of a general latent pro¬ 
cess {9i,i = 1,...,/} that follows an HMM Hm- We use this algorithm in 
Section 3.2 to simulate the expression magnitude j3 and the differential ef¬ 
fects A. Under a Laplace approximation, the working values of the read 
counts are defined as 

(3.1) Wijk = log(Ayfc) -L 

'^ijk 

These working values have an approximate normal distribution, specifically 

Wijk N{log{Xijk), 1/Xijk)- 

For a more general case, suppose 9i is the parameter of interest, and its 
value at the previous MCMC iteration was . Then, the Laplace approx¬ 
imation (3.1) gives us Wijk = log (AL^,) -f {Yijk - X*^^f)/X*^^ N{\og{Xijk), 
^/Kjk)-: where log (A^fc) = fijk + Zijk • 9i and log (AL^) = (,ijk + Zijk ■ 

Let the vectors w* = (rcjii,'Wii2, • • •, \ = (Ajii, Aii2, ■ • •, Ai 2 K 2 )' 

and K = Then w, iV(log(A,),Diag(l/A*)). 

Defining 

* Sj=l Sfc=l \jk[^ijk{Wijk — f,ijk)] 

in: = - - - - - 

* 2 ’ 

A^j=l /-jk=l ‘^ijk^ijk 

we have that w* is sufficient for 9i and w*\9i N{9i, 1/ Ylk=i ^fjkKjk)- 

Further assume that the prior of 5 is an m-state hidden Markov model {Hm) 
with transition matrix Cm, and 9i\hi = t ^ N{vt,Kl), for f = l,2,...,m. 
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where hi is the hidden state for We marginalize over 6i to obtain the 
approximate likelihood function 

p{w*\hi = t) iVK? + 1 / ^^zfjkKjk 

V j=i k=i 

(3.2) 

where t = 1 , 2 ,..., m. 

The conditional prior probability, P{hi = t\hj,j 7 ^ i), for t = 1,2,3, can be 
computed from the transition matrix. Cm, of the HMM Pm- 

The normalized product of the conditional prior probability and approx¬ 
imation (3.3) gives the approximate full conditional distribution of the dif¬ 
ferential state hi, from which we propose a new value, h^P°'^K We then pro¬ 
pose a new value, from the approximate full conditional of 9i given 

hi = The proposed values are jointly accepted or re¬ 

jected by a Metropolis-Hastings probability [Gamerman and Lopes (2006)] 
to ensure that the post-burn-in MCMC samples represent draws from model 
posterior. 

3.2. MCMC procedure. We iteratively generate MCMC samples of the 
chromosome-specihc parameters by the following procedure: 

1. The differential effects Ai,..., A/ and latent differential states hi,... ,hj 
are generated as in Section 3.1, given the expression magnitudes /3, subject- 
specific effects e (set to be 0 for nonpaired data) and treatment-replicate 
effects p. 

2. The mean expression magnitudes , 01 ,..., /?/ and latent states si,..., s/ 
are generated as in Section 3.1, given the differential effects A, subject- 
specific effects e (set to be 0 for nonpaired data) and treatment-replicate 
effects p. 

3. For paired data, the subject-specific effects Sik for i = 1,... ,I and 
k = 1,... ,K are also generated by a similar Laplace approximation and 
Metropolis-Hastings procedure as in step 1. 

4. Conditional on the mean expression magnitudes /3i,..., /I/, latent states 
si,... ,si, and the fact that P 2 — hi > d, the hyperparameters pi and p ,2 are 
jointly sampled from the restricted bivariate normal distribution using the 
R package tmvtnorm [Wilhelm and Manjunath (2013)]. 

5. For latent states h = 1,2,3, s = 1,2, the hyperparameters 4>h, 

are all generated from their full conditional distributions by Gibbs sampling 
steps. 

3.3. Detection of DE genes. For each chromosome, interest focuses on 
the latent vector of differential states hi,... ,hi, where, as dehned in equa¬ 
tion (2.2), hi = 1 {hi = 3) represents an under-expressed (over-expressed) 
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gene in treatment 2. We use the MCMC samples of the differential states to 
identify the DE genes while controlling for false discovery rate. Specifically, 
we use a Bayesian approach for controlling the FDR level first proposed by 
Newton et ah (2004), further studied by Muller, Parmigiani and Rice (2007), 
and subsequently applied in RNA-seq analysis by Lee et al. (2011). 

Let go be the desired nominal FDR level. In addition, let rj G {0,1} rep¬ 
resent the unknown truth that gene i is differentially expressed (r* = 1) or 
nondifferentially expressed {ri = 0). Further, let pi be the posterior prob¬ 
ability that gene i is differentially expressed. Last, let 6i G {0,1} denote 
the decision of calling gene i differentially expressed {6i = 1) or nondifferen¬ 
tially expressed {di = 0). Using the MCMC output, we compute the estimate 
Pi = Pr(rj = l|data) for genes z = 1,..., on all chromosomes. 

A possible decision is to flag all genes with pi greater than or equal to a 
certain threshold po- The resulting FDR would then be equal to 


D)l(33i>po) 


(3.3) 


FDR = 




Hence, the posterior expected FDR would be 
(3.4) FDR= 

^{pi>Po) 

Alternatively and more effectively than assigning a prespecified thresh¬ 
old Po, we may control the nominal FDR level go [Newton et al. (2004), 
Muller, Parmigiani and Rice (2007)]. Specifically, first we rank genes in de¬ 
creasing order of pt. Denote the ordered estimated posterior probabilities by 
P{i) > P{ 2 ) > ' " > P{N)- Thus, if we declare as differentially expressed the set 
of genes such that pi > p(^), for each d= 1,... ,N, then the corresponding 
posterior expected FDR will be 


(3.5) 


FDRd 


Z^i=l(l Pi)^{Pi>P(d)) 

2-ji=i ^(pi>P(d)) 


Eti(i-P(o) 

d 


Finally, the decision rule for detecting DE genes is to flag all genes with 
FDRrf < go. 


4. Simulation study. To compare the accuracy of hmmSeq with exist¬ 
ing RNA-seq techniques, we performed two simulation studies. In the first 
study, the data were generated from a Poisson distribution. In the second 
simulation study, we generated the data from a negative binomial distri¬ 
bution. Three popular RNA-seq techniques were considered for compar¬ 
isons; edgeR [Robinson, McCarthy and Smyth (2010)], baySeq [Hardcas- 
tle (2009)], and TSPM [Auer and Doerge (2011)]. The methods edgeR 
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Table 1 

Simulation study—parameters for generation of (3 




Normal components 

HMM transition matrix 

FMM probability 

/3 

Ml 

O'! 

M2 


A 

P 






/0.50 0.50\ 



1 

0.37 

3.91 

2.4 

1^0.05 0.95 ) 

(0.1,0.9) 


and baySeq have been implemented in R packages publicly available at 
http://www.bioconductor.org. R code for TSPM can be downloaded from 
http://www.stat.purdue.edu/'doerge/software/TSPM.R. The R code for 
hmmSeq is available in the Supplementary Materials [Cui et al. (2015)]. 

We first consider a simulation study with data generated from a Poisson 
distribution. For each of the following simulations, read counts were simu¬ 
lated for 12 chromosomes having 800 genes each, resulting in a total of 9600 
genes. Six replicates of the set of read counts were generated for each of 
the two treatments. The replicate effects were assumed to be equal to the 
estimates for the biological replicates data of Zeng et al. (2012). 

In the simulation study, the gene-specific magnitude factors /3 and the dif¬ 
ferential expression factors A were generated either from the hidden Markov 
model or finite mixture model with hyperparameters values given in Ta¬ 
bles 1 and 2. The hyperparameters of the normal components were chosen 
to match the estimates for the biological replicates data of Zeng et al. (2012). 
The other hyperparameters were chosen according to our experience working 
with hidden Markov models [Guha, Li and Neuberg (2008)]. Let “F” denote 
a finite mixture model and “H” denote a hidden Markov model. We consider 
each model resulting from each possible combination of a FMM or a HMM 
for (3 and a FMM or a HMM for A in a total of 4 possible models. We 
denote each model with FF, FH, HF and HH, with the first letter indicating 
the process for (3 and the second letter indicating the process for A. 

We perform model selection with the deviance information criterion (DIG) 
[Spiegelhalter et al. (2002)]. To evaluate the ability of DIG to discriminate 
among the four competing models, we have performed a simulation study. 


Table 2 

Simulation study—parameters for generation of A 




Normal components 


HMM transition matrix 

FMM probability 

A 

(j>i 



4>3 

vl 


B 


Q 







/0.50 

0.25 

0.25 \ 



-0.4 

0.013 

0.01 

0.4 

0.013 

0.10 

0.80 

0.10 

(0.22,0.56,0.22) 







\0.25 

0.25 

0.50 / 
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Table 3 

Simulation study—performance of DIC-based model selection 





DIG chosen model 


FF 

FH 

HF 

HH 

True model 

FF 

0.73 

0.10 

0.07 

0.10 


FH 

0.00 

0.77 

0.00 

0.23 


HF 

0.07 

0.03 

0.80 

0.10 


HH 

0.00 

0.10 

0.00 

0.90 


Specifically, for each of the 4 possible models, we have simulated 30 data sets. 
After that, we have analyzed each simulated data set with the 4 different 
hmmSeq models: FF, FH, HF and HH. Then, for each simulated data set we 
have conducted DIG model selection. Table 3 presents for each true model 
the proportion of times that DIG has chosen each of the 4 competing models. 
As we can see from Table 3, the DIG chooses the correct model most of the 
time. 

To compare hmmSeq with the other competing RNA-seq analysis meth¬ 
ods, we consider their receiver operating characteristic (ROG) curves. The 
ROG curve of each method describes the relationship between the true pos¬ 
itive rate (TPR) and the false positive rate (FPR) of gene detection. The 
TPR (which is also known as the sensitivity) is defined as the proportion 
of truly DE genes that are detected by the method. The FPR is defined 
as the proportion of non-DE genes that are erroneously identified as DE. 
The greater the area under the ROG curve, the greater the reliability of the 
method in detecting DE genes. For each simulation setup of FF, FH, HF 
and HH, we plot the ROG curves of (DIG picked) hmmSeq, edgeR, baySeq 
and TSPM averaged over 30 repetitions. Figure 3 displays the ROG curves 
for the methods hmmSeq (solid line), edgeR (dashed line), baySeq (dotted 
line) and TSPM (dot-dashed line) with the areas below the ROG curves 
being indicative of the relative accuracies of the methods in detecting DE 
genes. While edgeR beat the methods TSPM and baySeq in this simulation, 
hmmSeq achieves a substantially higher area under the ROG curve than the 
competing methods. 

Eor each of the four competing methods, Eigure 4 plots the observed FDR 
against the nominal FDR. Ideally, we would like to observe a 45 degree line 
through the origin in Figure 4 for each method. Observed FDR of edgeR is 
substantially smaller than the nominal FDR. The observed FDR for TSPM 
and baySeq, on the other hand, are quite liberal: The FDR for TSPM always 
exceeds 40%, while the FDR for baySeq exceeds 35% for most values of 
nominal FDR. Finally, FDR for hmmSeq is near and slightly lower than 
the 45 degree line. Therefore, hmmSeq is the method that performs best at 
controling FDR. 
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ROC curves 


ROC curves 



(a) 



(b) 


ROC curves 


ROC curves 



(c) 
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- hmmSeq 

— edgeR 
baySeq 
- TSPM 


0.4 0.6 

1-Specificity 


(d) 


0.8 


Fig. 3. For the simulation study of four setups FF (a), FH (b), HF (c) and HFl (d), 
four panels depict the ROC curves for hmmSeq, edgeR, baySeq and TSPM. Results are 
averaged over 30 simulated data sets for each setup, where for each simulated data set we 
use the DIC-chosen hmmSeq model. 


To investigate the robustness of hmmSeq to overdispersed data, we simu¬ 
lated RNA-seq counts from a negative binomial distribution. This distribu¬ 
tion is assumed by both edgeR and baySeq. Assume that y\X ~ Poisson(A) 
and A|r,?/i ~ gamma(r, (1 — Then, unconditionally, we obtain y|r,'0 ~ 

negative binomial(r,';/')• The negative binomial mean is = riltjiX — ijj) 
and the variance is = r?/i/(l — '(/i)^. The variance + m W/r) 

exceeds the mean reflecting overdispersion; (^ = 1/r is usually called 

the dispersion parameter. 
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FDR control plot 


FDR control plot 


K 

Q 


■o 

41 

t 




(a) 


(b) 


FDR control plot 


FDR control plot 




(c) 


(d) 


Fig. 4. For the simulation study, four panels depict the observed FDR versus nominal 
FDR for the methods hmmSeq, TSPM, edgeR and haySeq under four different simulation 
setups FF (a), FH (b), HF (c) and HH (d). Results are averaged over 30 simulated data 
sets for each setup, where for each simulated data set we use the DIC-chosen hmmSeq 
model. The proposed method controls the FDR closest and slightly lower to the f5 degree 
line. 


The gene-specific magnitude factors /3 and the differential expression fac¬ 
tors A were generated from a hidden Markov model with parameters values 
given in Tables 1 and 2. To generate the negative binomial reads for each 
gene i, we first generated the mean = Xijkc by equation (2.1) and the 
dispersion parameter Q from a gamma distribution [as in, e.g., Kvam, Liu 
and Si (2012)]. To mimic the dispersion observed in real data sets, the shape 
parameter and scale parameter of the gamma distribution were estimated 
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ROC curves FDR control plot 




1-Specificity Nominal FDR 

(a) (b) 


Fig. 5. For the negative binomial simulation study, the left panel depicts the ROC curves 
for DIC-selected hmmSeq, edgeR, baySeq and TSPM, and the right panel depicts the ob¬ 
served FDR versus nominal FDR for DIC-selected hmmSeq, TSPM, edgeR and baySeq. 
(a) ROC curves, (b) FDR control plot. 


by the method of moments using the gene-wise dispersion estimates of Zeng 
et al. (2012) (biological replicates) available from edgeR. We computed the 
gamma distribution parameters r = 1/C, and tp = C,p/{1 + and then hi¬ 
erarchically generated negative binomial read counts for 12 chromosomes 
having 800 genes each. This simulation procedure was replicated 30 times. 

We ht four hmmSeq models to the negative binomial data. In addition, 
we fit edgeR, baySeq and TSPM models to the data. DIG chose the true 
model (HH in this case) 19 out of 30 times. The ROC curves and FDR 
controls are plotted in Figure 5. In the FDR control plot in Figure 5(b), 
we find that none of the methods are accurate. The hmmSeq FDR tends 
to be large for small nominal FDR, converging to the 45 degree line as the 
nominal level increases. In contrast, the observed FDR of baySeq and edgeR 
are mostly lower than the nominal FDR. For the ROC plot in Figure 5(a), 
hmmSeq achieves the highest area under the ROC curve than the competing 
methods, demonstrating its high reliability in detecting DE genes. 

5. Data analysis. To illustrate the power and flexibility of our proposed 
RNA-seq analysis method, we have applied the hmmSeq method to analyze 
three data sets: Marioni et al. (2008) (technical replicates), Zeng et al. (2012) 
(biological replicates) and Henn et al. (2013) (paired data). For each of 
these three data sets, the treatment-specific replicate effects pjk are obtained 
by the upper-quartile normalizing technique of Bullard et al. (2010). In 
addition, we compare the results of the hmmSeq analysis with results of 
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TSPM [Auer and Doerge (2011)], baySeq [Hardcastle (2009)] and edgeR 
[Robinson, McCarthy and Smyth (2010)] based on their publicly available 
R package implementations. 

5.1. Marioni et al. (2008) data set. The Marioni et al. (2008) data set 
contains RNA-seq data for five technical replicates each of a single sample of 
kidney RNA (treatment 1) and liver RNA (treatment 2). Because genes with 
mostly small counts are not informative about differential expression, we 
have applied the filtering criterion of Auer and Doerge (2011) to eliminate 
the genes whose total read counts were less than 10. Additionally, the Y 
chromosome was ignored because many of its genes are transcribed on other 
chromosomes and the genders of the subjects are unknown. This yielded 
17,076 genes for the analysis. Further, we applied the quantile normalization 
of Bullard et al. (2010) to preprocess the data. We have fitted four hmmSeq 
models (FF, FH, HF and HH) to the data. The DIG favors the FH model 
as the best, which implies neighboring genes dependence with respect to the 
differential expression parameter A. Thus, in the remainder of this section 
we present hmmSeq results based on the FH model. 

We have applied the hmmSeq, edgeR, baySeq and TSPM methods to the 
Marioni et al. (2008) data set with a nominal FDR of g'o = 0.001 [threshold 
adopted by Auer and Doerge (2011)]. Recall that the simulation study in 
Section 4 had indicated that the actual FDR of TSPM and baySeq is rel¬ 
atively insensitive to the value of qo and is substantially higher when qq is 
small. The sets of DE genes identified by the methods hmmSeq, edgeR, bay¬ 
Seq and TSPM are summarized in Figure 6(a). The method TSPM detected 
9076 DE genes. In contrast, the hmmSeq method discovered only 2831 DE 
genes. 

A closer examination sheds light on the differing sets of genes detected 
by TSPM and hmmSeq. Of the genes discovered by hmmSeq, as many as 
2818 genes (99.5%) were also identified by the TSPM method. Focusing on 
the 6258 genes identified as DE by TSPM but not by hmmSeq, we find that 
TSPM flagged most of them as DE genes because they have extreme values 
of mean log 2 -fold change (from output of TSPM) for the treatments. In 
particular, we have observed that for the 111 genes with log 2 -fold change less 
than —20, all five gene-specific reads for liver RNA were zero. Table 4 lists 
10 randomly selected genes from this set, which reveals that the read counts 
for kidney, although positive, are not in hundreds or thousands like typical 
DE genes. Thus, TSPM tends to classify genes with 0 observations under 
any single condition as DE, while hmmSeq takes the variational magnitude 
into consideration. The result for edgeR lies between hmmSeq and TSPM. 

Similarly, all 33 genes with mean log 2 -fold change greater than 20 have 
zero counts for kidney and relatively small counts for liver. The hmmSeq 
method called all of them non-DE, but TSPM classihed them as DE genes 
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baySeq edgeR baySeq edgeR 




baySeq edgeR 



Eig. 6. Venn diagrams for the DE genes identified by the methods hmmSeq, baySeq, 
edgeR and TSPM in the data analyses, (a) Technieal replicates of Marioni et al. (2008), 
(b) biological replicates of Zeng et al. (2012), (c) paired biological replicates of Henn et al. 
(2013). 


due to their high mean log 2 -fold changes. The former call seems more rea¬ 
sonable, given that the read counts of truly DE genes are typically several 
orders of magnitude higher. 

5.2. Zeng et al. (2012) data set. We have applied hmmSeq to the data 
set of Zeng et al. (2012), which contains samples from 2 regions of the 
human brain, frontal pole and hippocampus, with 6 biological replicates 
for each region. Data sets with biological replicates typically exhibit sub¬ 
stantial over-dispersion or higher variability relative to the Poisson distri¬ 
bution. Researchers often model extra-Poisson variability using the bino¬ 
mial, negative binomial or Bayesian hierarchical Poisson models. We ac¬ 
count for over-dispersion through hierarchical priors on the parameters in 
equation (2.1), for example, through random differential expression factors. 
There were 13,574 genes available for analysis after filtering (read sums for 
all the libraries did not exceed 9). Further, we have applied the quantile 


Table 4 

Marioni et al. (2008) data set—ten randomly selected genes with mean \ 0 g 2 -f 0 ld change greater than 20 that were identified as DE genes 
by the TSPM method. RjLkK denotes jth run, kth replicate for kidney; RjLkL denotes jth run, kth replicate for liver 


Gene ID 

RiLiK 

R1L3K 

R1L7K 

R2L2K 

R 2 L 6 K 

R,iLj2L 

R,iL4L 

R-iLbL 

RiLsL 

R, 2 L 3 L 

ENSG00000198693 

2 

1 

2 

2 

8 

0 

0 

0 

0 

0 

ENSG00000162746 

3 

4 

2 

4 

4 

0 

0 

0 

0 

0 

ENSG00000168243 

4 

4 

1 

3 

3 

0 

0 

0 

0 

0 

ENSG00000188935 

2 

0 

3 

5 

5 

0 

0 

0 

0 

0 

ENSG00000173284 

5 

2 

5 

2 

3 

0 

0 

0 

0 

0 

ENSG00000114113 

4 

5 

4 

1 

1 

0 

0 

0 

0 

0 

ENSG00000169836 

5 

2 

3 

2 

3 

0 

0 

0 

0 

0 

ENSG00000170180 

2 

2 

3 

2 

5 

0 

0 

0 

0 

0 

ENSG00000186952 

3 

4 

2 

4 

2 

0 

0 

0 

0 

0 

ENSG00000164385 

3 

2 

3 

3 

4 

0 

0 

0 

0 

0 
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normalization of Bullard et al. (2010) to preprocess the data. We have fit¬ 
ted four hmmSeq models (FF, FH, HF and HH) to the Zeng et al. (2012) 
data set. DIG chooses HH as the best model, which indicates neighboring 
genes dependence with respect both to the differential expression parameter 
A and to the expression level parameter /3. Thus, in the remainder of this 
section we present hmmSeq results based on the HH model. 

We have applied the hmmSeq, edgeR, baySeq and TSPM methods to 
the data set of Zeng et al. (2012) with a nominal FDR qq = 0.05. TSPM, 
edgeR and baySeq, respectively, identified 236, 134 and 278 DE genes. The 
hmmSeq technique identified 333 DE genes. The overlapping set of DE genes 
for the methods are summarized in Figure 6(b) and reveal a greater lack of 
agreement between the methods than for the Marioni et al. (2008) data 
set. Only 1 gene is identified as DE by all four methods. This low level 
of agreement is a result of the low overlap that TSPM has with the other 
methods. In contrast, hmmSeq has relatively large overlap both with edgeR 
(76 genes) and baySeq (110 genes). 

We have investigated the biological implications of the results obtained 
with the hmmSeq analysis of differential expression of the hippocampus to 
the frontal pole. Though we expect a modest amount of differentially ex¬ 
pressed genes, we do find some meaningful results that are supported in the 
literature. There is an increase in gene expression of Akt2 in the hippocam¬ 
pus compared to the frontal pole. Akt2 is a gene involved in insulin signal¬ 
ing, which occurs in the hippocampus [Robertson et al. (2010), Agrawal and 
Gomez-Pinilla (2012)]. In addition, Wnt7B is upregulated in the hippocam¬ 
pus where Wnt activity has been implicated in signaling of hippocampal 
synapses [Gogolla et al. (2009)]. Last, STAT5A is known to be expressed in 
the hippocampus [Kalita et al. (2013)], and our results show this upregu- 
lation. Taken all together, the results our hmmSeq method produced show 
biologically relevant genes when comparing the hippocampus to the frontal 
pole. In addition, we have conducted a functional analysis using DAVID 
[Huang, Sherman and Lempicki (2009a, 2009b)]; both our hmmSeq method 
and edgeR identified differentially expressed genes that are known to be ex¬ 
pressed in brain tissue. The biological experiment presented tries to identify 
genes that are differentially expressed between two types of brain tissue. 
These results, taken in perspective with the biological experiment, suggest 
that the genes identified as differentially expressed via these two methods 
are relevant to the biological problem and help support the validity and 
accuracy of our predictions. 

5.3. Henn et al. (2013) paired data set. Here we illustrate the application 
of the hmmSeq method to paired data sets with an analysis of a subset of 
an RNA-seq data set obtained by Henn et al. (2013) on immune response 
to a trivalent influenza vaccine. The original data set contains RNA-seq 
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data from B cell samples for five subjects before vaccination (day 0) and 
for each of 10 days after vaccination (days 1 through 10). We consider a 
paired subset of three previously vaccinated subjects in the original data 
set, where we compare gene expression before vaccination to gene expression 
after vaccination. Since peak B cell response usually appears 5-9 days post¬ 
vaccination, we apply our hmmSeq method to identify B cell gene differential 
expression between day 0 and day 7. 

For the hmmSeq analysis, we first estimate the variance cr^ of subject- 
specific random effects from the data. Specifically, for each gene we have 
fitted a generalized linear mixed model with random subject effects, resulting 
in an estimated random effects variance for each gene. We then use the 
median of these estimates of random effects variances as an empirical Bayes 
estimate of cr^. In addition, we have fitted the four hmmSeq models FF, 
FH, HF and HH. We have found that the DIG favors the FF model, that 
is, a finite mixture model without neighboring genes dependence. Thus, in 
the remainder of this section we present hmmSeq results based on the FF 
model. 

We have analyzed this immune response data set using a nominal FDR 
of 0.05. To accommodate the paired data structure, in the edgeR analysis 
we include subject-specific fixed effects. Such edgeR analysis identifies 175 
genes as differentially expressed. The TSPM that we used ignores the paired 
structure and treats all observations for each gene as independent, which 
identifies a total of 186 genes. Finally, in a paired baySeq analysis 100 genes 
are flagged to be DE. Figure 6(c) presents a Venn diagram that summarizes 
the results for TSPM, edgeR, baySeq and hmmSeq. 

In order to further evaluate the competing methods, we compare their 
results to those found by Henn et al. (2013). Henn et al. (2013) used the 
RNA-seq data set from all 11 days, whereas we used only the data from 
days 0 and 7. Thus, here we use the results of Henn et al. (2013) as a 
benchmark. Specifically, Henn et al. (2013) identified a set of 742 genes as 
what they call the plasma cell gene signature (PCgs), that is, genes that have 
a common significant time-varying signature. Hence, in Table 5 we list the 
overlap of the PCgs set with the genes identified as differentially expressed 
by hmmSeq, edgeR, baySeq and TSPM. Our proposed hmmSeq method 
obtains the largest overlap with PCgs set (130 genes), and edgeR overlaps 


Table 5 

Henn et al. (2013) data set—overlap of plasma cell gene signature (PCgs) set with genes 
identified by hmmSeq, edgeR, baySeq and TSPM 



hmmSeq 

edgeR 

baySeq 

TSPM 

PCgs 

130 

121 

7 

1 
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121 genes with PCgs. We recall from Section 4 that hmmSeq and edgeR 
are the two methods with the highest area under the ROC curve. Thus, the 
overlap with the PCgs set shows the power of DE genes identification of the 
proposed hmmSeq method. 

6. Conclusion. We propose hmmSeq, a method based on Bayesian hier¬ 
archical models for detecting DE genes between two treatments for paired or 
nonpaired data in RNA-seq analyses. The approach employs hidden Markov 
models to account for the statistical dependence between the gene counts 
of neighboring genes observed in many RNA-seq data sets. The hmmSeq 
model can be applied to studies with either biological or technical repli¬ 
cates, automatically adjusting for any overdispersion relative to the Poisson 
distribution. Through simulated and real data sets, we compare and con¬ 
trast the performance of hmmSeq with some well-known methods in the 
literature, demonstrating the reliability and success of our approach in the 
identification of DE genes. 

We have developed DIC-based model selection to decide for each data set 
whether HMM or EMM should be used to model gene expression magnitude 
and/or differential expression. For the Marioni et al. (2008) data set the 
DIC-selected model is the FH model, for the Zeng et al. (2012) data set 
the DIC-selected model is the HH model, and for the Henn et al. (2013) 
data set the DIC-selected model is the FF model. Thus, for one data set 
there appears to be neighboring genes dependence in expression magnitude. 
Even more important, for two data sets there is evidence of neighboring 
genes dependence in differential expression. This co-differential expression 
can be justified by how ancient species organized their genomes and by 
evolution. Specifically, more ancient species, such as bacteria, organize their 
genomes based on operons, where genes involved in the same process or 
needed at the same time are transcribed in tandem [Alberts et al. (1994)]. 
Throughout evolution, operons have been divided into individual genes, but 
genes involved in the same process now reside in gene clusters [Hurst, Pal 
and Lercher (2004)]. Thus, neighboring genes tend to be jointly differentially 
expressed. 

To further examine spatial genomic dependence (and clustering) among 
detected DE genes, we have devised the following statistical test. Consider 
any detected DE gene and the next detected DE gene in the chromosome as 
neighboring DE genes. Consider the distance between two neighboring DE 
genes as the number of non-DE genes between them. If there is no spatial 
dependence, then all the distances between any two neighboring DE genes 
should be a random sample from a geometric distribution. Hence, to test 
for spatial dependence, we collect all the distances between neighboring DE 
genes and conduct a goodness-of-fit test of the hypothesis that the empirical 
distribution equals the null theoretical geometric distribution. We use this 
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procedure to test for spatial genomic dependence for DE calls from edgeR 
and hmmSeq. First, we performed this test for the Henn et al. (2013) data set 
for which hmmSeq prefers finite mixture model and spatial independence. 
The spatial genomic dependence test for DE calls from edgeR and hmmSeq 
yields p-values equal to 0.2201 and 0.5178, respectively, further supporting 
hmmSeq suggestion of spatial independence. Second, we performed this test 
for the two real data sets for which hmmSeq prefers spatial dependence, 
that is, the Marioni et al. (2008) and the Zeng et al. (2012) data sets. For 
the Marioni et al. (2008) data set, the spatial dependence test for DE calls 
from both hmmSeq and edgeR yield p-values smaller than 2.2e-16. That is, 
even though edgeR does not account for spatial dependence, its detected 
DE genes for the Marioni et al. (2008) data set cluster spatially. For the 
Zeng et al. (2012) data set, edgeR only detected 134 DE genes which did 
not provide enough power for the goodness-of-fit data set (p-value greater 
than 0.9). In contrast, there is strong statistical evidence that the 333 genes 
identified by hmmSeq as DE cluster spatially (p-value smaller than 2.2e- 
16). Therefore, these data sets point to the need to consider spatial genomic 
dependence in studies of differential gene expression. 

In addition to genomic spatial dependence among genes based on genomic 
position, for future research work we plan to extend hmmSeq to include 
other sources of dependence among genes. Recent experimental techniques 
such as HiC and ChlA-PET allow for the identihcation of explicit promoter- 
promoter and promoter-enhancer-promoter interactions [Edelman and Fraser 
(2012), Mercer and Mattick (2013), van Arensbergen, van Steensel and 
Bussemaker (2014)]. In addition, we note that genes that belong to the 
same active functional pathways tend to be co-expressed [Tegge, Caldwell 
and Xu (2012)]. This extension of hmmSeq may need a non-Markovian spa¬ 
tial correlation model. We leave this challenging inferential problem to future 
research. 

DE gene call lists are frequently used in downstream pathway function 
calls in what is known as functional enrichment analysis. Because functional 
enrichment analysis methods usually assume independence of DE gene calls, 
caution needs to be taken when using the DE gene call lists generated by 
hmmSeq. When hmmSeq decides that the best model is a finite mixture 
model without spatial dependence, then one can use hmmSeq’s DE gene 
call list without any concern. However, when hmmSeq decides that a spa¬ 
tial dependence model is warranted, then the assumption of independence 
no longer holds. This opens up a tremendous opportunity for future re¬ 
search that performs joint differential expression gene calls and functional 
enrichment analysis. We envision this joint analysis may be implemented 
by extending hmmSeq to incorporate information on functional pathway 
networks. 
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In terms of computational time, on a desktop with a 2.3 GHz proces¬ 
sor and 4 GB memory, hmmSeq takes approximately 3 hours to analyze 
a 1200-gene chromosome. Although it does take a longer time than other 
methods, hmmSeq often achieves a higher accuracy of DE gene detection 
than other methods by a realistic model that allows for spatial genomic de¬ 
pendence. Moreover, the computational time of all the considered statistical 
methods is negligible when compared to the time (on the order of months 
or years) required by subject-matter scientists to perform experiments to 
obtain RNA-seq data. Furthermore, to limit the computational time the 
hmmSeq analysis can be performed in parallel for individual chromosomes. 
Finally, when compared to the high costs of RNA-seq extraction, the infor¬ 
mation gains obtained by the hmmSeq methodology seem well worth the 
relatively low computational costs. 

The hmmSeq method we propose relies on a single user-specified “tun¬ 
ing” parameter qq, that is, the nominal false discovery rate. A default value 
for qo between 0.001 or 0.05 has produced satisfactory results for all the 
data sets, real or simulated, that we have analyzed, facilitating “black box” 
applications of hmmSeq. Future work will focus on extending hmmSeq to 
investigations with more than two treatments. An R package implementing 
the hmmSeq framework will be submitted to GRAN upon publication of the 
manuscript. 
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SUPPLEMENTARY MATERIAL 

Supplement to “hmmSeq: A hidden Markov model for detecting differ¬ 
entially expressed genes from RNA-seq data.” 

(DOL 10.1214/15-AOAS815SUPP; .zip). The R code for hmmSeq. 
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